I’ve been using R and data analysis tools for a while already, but I like so many digital humanists, haven’t studied these specific toosl formally. This course seems like a good place to learn about the tools and methods in a more structured way, and the course programme looks very promising.
I’m familiar with the basic structures of R ,so this was pretty
smooth sailing. I’ve encountered some of the tidyverse stuff before, but
haven’t really studied it. This introduction to it was quite
enlightening, and I now have a sighnificantly better understanding of th
tidyverse structures. I can see the merits of the pipe
(%>%) operator: It makes following the workflow steps
very easy.
tibble() was
a new structure to me too, and that’s going to be useful too. The
promise of stricter adherence to input variable types and less unhelpful
“help” when compared to dataframes sounds like a way to avoid a lot
headaches.
The R for Health Data Science was very easy to follow, and is well structured. I’m pretty sure I’m goinmg to use it as an R reference in the future, it is a lot clearer than much of the official R documentation.
My github repo for the course is here: https://github.com/villevaara/IODS-project
Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)
library(readr)
learning2014 <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166 7
# the data has 166 rows, 7 columns of variables for each row.
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# ... the variables are as listed by the above function, with gender being of chr type, rest numeric.
The data is study methods survey data from a survey conducted in 2014. It has been summarised above, with various categories given mean score of wider variety of data points for each respondent.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'
qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
The function produces a really nice overview easily. I wonder if it is that easy without so fitting data. Some variables seem to correlate strongly with points, especially attitude. There’s also something curious going on between surface and deep learning methods. Males seem to have a strong inverse correlation there, while females do not. The sample has female respondents over represented though. Maybe male respondents either employ deep or surface methods, and female respondents might employ both.
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
# fit a linear model
p_surf_model3 <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(p_surf_model3)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
p_model1 <- lm(points ~ attitude, data = learning2014)
summary(p_model1)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Neither ‘stra’ nor ‘surf’ seem to be statistically significant. At least if I got this right. It looks like the results are roughly the same without those two, as with them. ‘attitude’ is te sole significant variable when testing effects on ‘points’.
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
p_att_model <- lm(points ~ attitude, data = learning2014)
summary(p_att_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The model based on attitude explains ~19% of the variability in points. That’s the strongest, and almost sole variable having a significant effect on points. Adding in stra and surf did raise multiple R-squared a hair, but adding all three already made adjusted R-squared drop a bit. My understanding is that this might then not be helpful, as that would needlessly complicate the model.
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
plot(p_att_model, which = c(1,2,5))
The material states that linear regression modelling has four main
assumptions:
Residuals vs fitted - The residuals seem to show a normal distribution with a fit of zero, as they should.
Normal QQ-plot - The residuals seem to diverge from a straight line here a bit at the start ansd end. That might be alarming? But is it enough? I’m not sure I’m yet qualified to answer.
Residuals vs leverage - All the observations fall within Cook’s distance, so there are no especially influential observations skewing the results.
library(readr)
alc <- read_csv("data/alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(alc)
## [1] 370 35
str(alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
# ... the variables are as listed by the above function, with gender being of chr type, rest numeric.
The data in ‘alc’ describes questionnaire data on students in two Portuguese schools. It has student metadata, and details on their school performance and alcohol consumption.
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
alc_subset = select(alc, c('alc_use', 'high_use', 'sex', 'goout', 'G3', 'absences'))
gather(alc_subset) %>% glimpse()
## Rows: 2,220
## Columns: 2
## $ key <chr> "alc_use", "alc_use", "alc_use", "alc_use", "alc_use", "alc_use"…
## $ value <chr> "1", "1", "2.5", "1", "1.5", "1.5", "1", "1", "1", "1", "1.5", "…
gather(alc_subset) %>% View()
gather(alc_subset) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Absences and G3 score seem to have high numbers in both ends of the scale. Absences also has a strange peak in the middle, which might be an artifact of school practices. Maybe X amount of absences either automatically disqualifies one from the class, or X amount is the maximum permitted amount. Alcohol use is weighted towards the lower end of the scale, out-goingness is somewhat normally distributed and male:female ratio is close to 1:1.
library(dplyr)
alc_subset %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
Males are somewhat over represented in the high alcohol usage group, as hypothesised.
alc_subset %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
High alcohol use seems to somewhat lower the grades in males, but not in females.
# Some of the exercise code fit with my hypothesis nicely...
g1 <- ggplot(alc_subset, aes(x = high_use, y = G3))
g1 + geom_boxplot(aes(col = sex)) + ylab("grade")
And the same can be seen from the above plot.
alc_subset %>% group_by(high_use) %>% summarise(mean_goout = mean(goout))
## # A tibble: 2 × 2
## high_use mean_goout
## <lgl> <dbl>
## 1 FALSE 2.85
## 2 TRUE 3.73
Going out seems to quite clearly correlate with high alcohol usage, as was the hypothesis.
alc_subset %>% group_by(high_use) %>% summarise(mean_absences = mean(absences))
## # A tibble: 2 × 2
## high_use mean_absences
## <lgl> <dbl>
## 1 FALSE 3.71
## 2 TRUE 6.38
And alcohol use and absences correlate.
g2 <- ggplot(alc_subset, aes(x = high_use, y = absences))
g2 + geom_boxplot(aes(col = sex)) + ylab("absences") +
ggtitle("Student absences by alcohol consumption and sex")
But gender doesn’t seem to make a huge difference here. Still, alhocol
use in males does correlate with absences more clearly.
m <- glm(high_use ~ absences + goout + sex + G3, data = alc_subset, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + sex + G3, family = "binomial",
## data = alc_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9500 -0.7990 -0.5263 0.8110 2.4703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.58037 0.70357 -5.089 3.60e-07 ***
## absences 0.08263 0.02267 3.645 0.000268 ***
## goout 0.70551 0.12167 5.799 6.68e-09 ***
## sexM 1.01859 0.25986 3.920 8.87e-05 ***
## G3 -0.04508 0.03985 -1.131 0.258029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 372.82 on 365 degrees of freedom
## AIC: 382.82
##
## Number of Fisher Scoring iterations: 4
absences, goout, and gender all strongly correlate with high_use, as demonstrated by the low p-scores, but G3 clearly does not. So, let’s drop G3 from the model:
m <- glm(high_use ~ absences + goout + sex, data = alc_subset, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + sex, family = "binomial",
## data = alc_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8060 -0.8090 -0.5248 0.8214 2.4806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.18142 0.48085 -8.696 < 2e-16 ***
## absences 0.08478 0.02266 3.741 0.000183 ***
## goout 0.72793 0.12057 6.038 1.56e-09 ***
## sexM 1.02223 0.25946 3.940 8.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 374.10 on 366 degrees of freedom
## AIC: 382.1
##
## Number of Fisher Scoring iterations: 4
And add coefficients and confidence intervals:
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.0152768 0.005706586 0.03774415
## absences 1.0884786 1.042536998 1.14083818
## goout 2.0707952 1.644519878 2.64113883
## sexM 2.7793948 1.682230346 4.66256127
Alcohol consumption seem to here have a significantly lower effect on on absences, than how the results earlier seemed. But the scale for absences is a lot more granular than for the other variables here, so that might have an effect? Or maybe the effect isn’t as pronounced as on the other variables. There’s roughly 50% chance (1:1 odds) that there’s a unit change in absences if alcohol consumption state flips from high to low. For gender that chance is ~75% (2.78:1) and ~65% chance for out-goingness.
probabilities <- predict(m, type = "response")
alc_subset <- mutate(alc_subset, probability = probabilities)
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)
raw_num <- table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)
raw_num
## prediction
## high_use FALSE TRUE
## FALSE 242 17
## TRUE 61 50
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction) %>%
prop.table() %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65405405 0.04594595 0.70000000
## TRUE 0.16486486 0.13513514 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
precision <- raw_num[2,2]/sum(raw_num[,2])
precision
## [1] 0.7462687
recall <- raw_num[2,2]/sum(raw_num[2,])
recall
## [1] 0.4504505
The model doesn’t seem that great. Precision was 75%, that is only 3/4 of the positive predictions were accurate. Recall is 45%, so more than half of actual high_use cases were undetected.
1 - (sum(alc_subset$high_use == alc_subset$prediction) / nrow(alc_subset))
## [1] 0.2108108
Total proportion of inaccurate classifications is ~21%. That’s not too great a result. Seems like alcohol consumption does not explain all other behaviour, but still a surprisingly high number of predictions are correct. The relatively good result is probably because the model seems to err on the side of not high use, and as that is the more likely result the number of accurate predictions is misleadingly high. Precision and recall, (which were not great) are probably better indicators.
If we try predicting with the three variables we decided were significant, by predicting high alcohol consumption if at least two of the three are on the side of the indicator favoring high alcohol consumption, the results look like this:
alc_subset$goout_guess <- alc_subset$goout > mean(alc_subset$goout)
alc_subset$abs_guess <- alc_subset$absences > mean(alc_subset$absences)
alc_subset$gender_guess <- alc_subset$sex == "M"
alc_subset$ultimate_guess <- (alc_subset$abs_guess + alc_subset$goout_guess + alc_subset$gender_guess) >= 2
1 - (sum(alc_subset$high_use == alc_subset$ultimate_guess) / nrow(alc_subset))
## [1] 0.2810811
We got 28% wrong, which is a bit higher than the model’s 21%, but not that high. I wouldn’t base students’ alcoholism intervention program on either.
Didn’t have time to finish these this week! Less work vfor the reviewer I guerss. I did a few of the starting points, and will finalize these later.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The dataset has various socio-political data of Boston suburbs. The data varies from crime rates to property values and student/teacher ratios in schools.
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(GGally)
cor_matrix <- cor(Boston) %>% round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
ggpairs(Boston)
corrplot(cor_matrix, method="circle", type="upper")
Boston %>% ggplot(aes(indus)) +
geom_histogram(binwidth = 1) +
ylab("Industry ratio")
Boston %>% ggplot(aes(rad)) +
geom_histogram(binwidth = 1) +
ylab("Radial highway access")
Boston %>% ggplot(aes(tax)) +
geom_histogram(binwidth = 20) +
ylab("Property tax rate")
Boston %>% ggplot(aes(age)) +
geom_histogram(binwidth = 10) +
ylab("Building age")
Few variables have a very strongly stratified scope. Some suburbs have
very high values, most others very low. See for example crime rate
(crim), river proximity (chas) and proportion of people of colour of the
residents (black).
Industry rate seems to correlate (nagatively or positively) strongly with many of the other variables in the dataset. The histograms of the correlating variables vary in shape quite widely, so probably there’s quite a bit of variance in suburbs on these.
library(dplyr)
boston_scaled <- scale(Boston) %>% as.data.frame()
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high"))
boston_scaled$crim <- crime
# Create test and train datasets
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
All the variables were standardized with the scale() -function. The “crim” (crime rate) variable in the Boston dataset was overwritten with the standardized categorized version.
lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2450495 0.2475248 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 0.9889873 -0.9152790 -0.08661679 -0.8936698 0.39998373 -0.9100671
## med_low -0.1338286 -0.3098923 -0.03371693 -0.5312822 -0.17019779 -0.3156434
## med_high -0.3749021 0.2270614 0.12138096 0.4486191 0.04113992 0.4391332
## high -0.4872402 1.0149946 -0.07348562 1.0569478 -0.40713301 0.8111582
## dis rad tax ptratio black lstat
## low 0.9818660 -0.6893371 -0.6993106 -0.4404946 0.37302855 -0.74597110
## med_low 0.3306866 -0.5491659 -0.4738026 0.0200727 0.32400075 -0.10411159
## med_high -0.4113998 -0.4202708 -0.3002731 -0.3766994 0.06393206 0.01483488
## high -0.8441771 1.6596029 1.5294129 0.8057784 -0.77801174 0.84952814
## medv
## low 0.47125430
## med_low -0.04650984
## med_high 0.15452621
## high -0.65528732
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 1.377712e-01 0.562724358 -0.94677092
## indus 1.468012e-02 -0.351054070 0.07196873
## chas -8.144627e-05 0.042836590 0.09578875
## nox 3.253429e-01 -0.761186294 -1.21514962
## rm 3.410481e-02 0.002352645 -0.16283911
## age 2.838052e-01 -0.303337780 -0.09867693
## dis -1.166495e-01 -0.130041749 0.06757985
## rad 3.380960e+00 0.800767847 -0.16460783
## tax 3.019564e-04 0.234660291 0.54279821
## ptratio 1.578711e-01 0.076314641 -0.06222108
## black -1.422233e-01 0.002439578 0.15281133
## lstat 1.554836e-01 -0.208230252 0.37187820
## medv 6.818721e-02 -0.370854142 -0.16948090
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9494 0.0392 0.0114
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
test_crime <- test$crim
test_lda <- dplyr::select(test, -crim)
lda.pred <- predict(lda.fit, newdata = test_lda)
table(correct = test$crim, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 9 1 0
## med_low 8 15 4 0
## med_high 0 11 14 1
## high 0 0 1 27
Looks like the ‘low’ -category got mixed predictions of low and med_low, while in the other categories the predictions fit the test data correctly. The sample size is quite small, and the actual values of the categorized variable might be close to the dividing line in the low/med_low area. Or the explaining variables might not cover all the actual reasons for high crime rates. It’s very unlikely that they would.
boston_scaled <- scale(Boston) %>% as.data.frame()
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
human <- read.table(file = "data/human.csv", row.names = 1, sep = ",", header = TRUE)
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(tidyr)
library(corrplot)
library(ggplot2)
library(GGally)
cor_matrix <- cor(human) %>% round(2)
cor_matrix
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
ggpairs(human)
corrplot(cor_matrix, method="circle", type="upper")
There’s quite strong correlation between many of the variables in the data. Distributions vary, but often the shapes of the distribution historgrams are similar- for example between Mat.Mor and GNI, or between Labo.FM, Edu.Exp and Life.Exp.
For refenrece, the data description from https://github.com/KimmoVehkalahti/Helsinki-Open-Data-Science/blob/master/datasets/human_meta.txt :
“GNI” = Gross National Income per capita “Life.Exp” = Life expectancy at birth “Edu.Exp” = Expected years of schooling “Mat.Mor” = Maternal mortality ratio “Ado.Birth” = Adolescent birth rate
“Parli.F” = Percetange of female representatives in parliament “Edu2.F” = Proportion of females with at least secondary education “Edu2.M” = Proportion of males with at least secondary education “Labo.F” = Proportion of females in the labour force “Labo.M” ” Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M “Labo.FM” = Labo2.F / Labo2.M
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 -0.62678110 0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 -0.06199424 -0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 0.07020294 -0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 0.05834819 -0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 0.72727675 0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 0.25170614 0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 -0.04986763 0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 -0.01396293 0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
biplot(pca_human_std, choices = 1:2, col = c("grey40", "deeppink2"))
Standardizing the variables brought their properties in the PCA much closer together. The plot of the non-standardized PCA is almost unreadable- the only variable that stands out is GNI, and that should follow from the fact that the absolute values there are way higher than in the other variables.
In the standardized version, few clear directions are discernible. Maternal mortality ratio and Adolescent birth rate clearly have similar pull, both related to births. The gender balance in labour and percentage of female representatives in parliament also have an effect in the same direction. Rest of the variables (GNI, Life expectancy at birth, Expected years of schooling and gender balance in secondary education) all form a third group.
PC2 creates an axis, along which the variables connected with birth (maternal mortality and adolescence birth rate) and the other group connected with more general metrics for quality of life (GNI, Life expectancy at birth, Expected years of schooling) and gender balance in secondary education form opposite poles. This makes sense, as the first group would strongly correlate with large portion of the population living in poor conditions, and high GNI, etc would on the other hand correlate with relatively properous societies.
PC1 is more connected with gender equality, and does not seem to directly correlate with high GNI, but there the two variables (gender balance in parliament and in work force) are less strongly correlated than in the other groupings. This might make sense- you could have a relatively conservative and low income society where women are employed in relatively low status jobs, but required to work still as the total income of the economic unit would otherwise be too low. Parliamentary representation on the other hand correlates more with general indicators of wealth (GNI, etc.), and this would indicate that a more equal societies in terms of power relations are more often also more prosperous.
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
ggpairs(tea[,c("age", "how", "where")])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(gather(tea), aes(value)) +
geom_histogram(stat="count") +
facet_wrap(~key, scales = 'free_x')
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
Did have a quick look at some selected correlations with ggpairs. Also
histograms of all the columns.
tea_subset <- tea[,c("spirituality", "healthy", "friendliness", "relaxing", "sophisticated", "exciting", "feminine")]
tea_subset$healthy <- relevel(tea_subset$healthy, "Not.healthy")
tea_subset$friendliness <- relevel(tea_subset$friendliness, "Not.friendliness")
tea_subset$exciting <- relevel(tea_subset$exciting, "No.exciting")
tea_subset$feminine <- relevel(tea_subset$feminine, "Not.feminine")
library(FactoMineR)
mca <- MCA(tea_subset, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_subset, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.208 0.185 0.154 0.135 0.120 0.109 0.089
## % of var. 20.786 18.492 15.444 13.509 12.033 10.880 8.856
## Cumulative % of var. 20.786 39.279 54.723 68.232 80.265 91.144 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.996 1.591 0.653 | -0.211 0.080 0.029 | -0.082 0.014
## 2 | 0.951 1.449 0.546 | 0.333 0.200 0.067 | 0.068 0.010
## 3 | 0.375 0.225 0.174 | -0.600 0.649 0.446 | -0.219 0.103
## 4 | 0.034 0.002 0.001 | -0.468 0.395 0.167 | 0.837 1.511
## 5 | 0.286 0.131 0.062 | -0.314 0.178 0.074 | 0.539 0.628
## 6 | 0.781 0.978 0.446 | -0.686 0.849 0.344 | 0.183 0.072
## 7 | 0.781 0.978 0.446 | -0.686 0.849 0.344 | 0.183 0.072
## 8 | -0.458 0.336 0.360 | -0.384 0.266 0.253 | -0.400 0.345
## 9 | 0.545 0.477 0.223 | -0.284 0.146 0.061 | 0.464 0.465
## 10 | 0.375 0.225 0.174 | -0.600 0.649 0.446 | -0.219 0.103
## cos2
## 1 0.004 |
## 2 0.003 |
## 3 0.059 |
## 4 0.534 |
## 5 0.219 |
## 6 0.024 |
## 7 0.024 |
## 8 0.274 |
## 9 0.161 |
## 10 0.059 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## Not.spirituality | 0.300 4.245 0.197 7.677 | -0.048 0.123 0.005
## spirituality | -0.657 9.303 0.197 -7.677 | 0.105 0.269 0.005
## Not.healthy | 0.472 4.588 0.095 5.340 | 0.495 5.668 0.105
## healthy | -0.202 1.966 0.095 -5.340 | -0.212 2.429 0.105
## Not.friendliness | 1.045 14.518 0.262 8.849 | -0.210 0.658 0.011
## friendliness | -0.251 3.480 0.262 -8.849 | 0.050 0.158 0.011
## No.relaxing | 0.428 4.731 0.110 5.747 | 0.892 23.133 0.480
## relaxing | -0.258 2.859 0.110 -5.747 | -0.539 13.979 0.480
## Not.sophisticated | 1.022 20.330 0.413 11.109 | -0.361 2.855 0.052
## sophisticated | -0.404 8.037 0.413 -11.109 | 0.143 1.129 0.052
## v.test Dim.3 ctr cos2 v.test
## Not.spirituality -1.231 | -0.487 15.079 0.520 -12.472 |
## spirituality 1.231 | 1.068 33.046 0.520 12.472 |
## Not.healthy 5.598 | 0.371 3.813 0.059 4.196 |
## healthy -5.598 | -0.159 1.634 0.059 -4.196 |
## Not.friendliness -1.777 | 0.890 14.168 0.190 7.535 |
## friendliness 1.777 | -0.213 3.396 0.190 -7.535 |
## No.relaxing 11.985 | -0.453 7.160 0.124 -6.093 |
## relaxing -11.985 | 0.274 4.326 0.124 6.093 |
## Not.sophisticated -3.926 | -0.175 0.806 0.012 -1.906 |
## sophisticated 3.926 | 0.069 0.319 0.012 1.906 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## spirituality | 0.197 0.005 0.520 |
## healthy | 0.095 0.105 0.059 |
## friendliness | 0.262 0.011 0.190 |
## relaxing | 0.110 0.480 0.124 |
## sophisticated | 0.413 0.052 0.012 |
## exciting | 0.005 0.637 0.040 |
## feminine | 0.373 0.005 0.135 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic")
Looks like Dim2 reveals, that there’s a tendency among the respondents
to either assign multiple attributes to tea, or not assign any. Dim 1
reveals that exciting and ‘not relaxing’ are paired, and the other way
round. Rest of the attributes have less strong associations with other
attributes, but there are certain groupings: sophistication, femininess
and spiritulaity are grouping close to each other, as well as
not-friendliness and unsophistication.
Couldn’t finish these in time. Apologies to whoever is checking this.
RATS <- read.table(file = "data/rats.csv", sep = ",", header = TRUE)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
library(ggplot2)
# Draw the plot
ggplot(RATS, aes(x = Time, y = weight, color = ID, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$weight), max(RATS$weight)))
Might have made more sense to have them all in one graph, with linetype
= Group instead of ID.
library(dplyr)
library(tidyr)
RATS <- RATS %>%
group_by(Time) %>%
mutate(stdweight = (weight - mean(weight))/ sd(weight)) %>%
ungroup()
# Plot again with the standardized
ggplot(RATS, aes(x = Time, y = stdweight, color = ID, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:16, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
Standardized version.
n <- 16
# Summary data with mean and standard error of weight by treatment and week
RATSS <- RATS %>%
group_by(Group, Time) %>%
summarise(mean = mean(weight), se = sd(weight)/n ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
And mean weights:
RATS8S <- RATS %>%
group_by(Group, ID) %>%
summarise(mean=mean(weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
library(ggplot2)
ggplot(RATS8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight)")
Had to filter outliers from each Group of rats separately.
RATS8S_filtered <- RATS8S %>% filter(!(Group == 1 & mean < 250) &
!(Group == 2 & mean > 550) &
!(Group == 3 & mean < 500))
ggplot(RATS8S_filtered, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
… and the groupings are very tight as a results of filtering. I wonder if it did really make sense?
t-test only makes sense pairwise, so we’ll filter for groups. Could make a t-test of group 1 vs group 3 too, but that’d show even more obvious significance than these tests.
RATS12 <- RATS8S_filtered %>% filter(Group == 1 | Group == 2)
t.test(mean ~ Group, data = RATS12, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -41.656, df = 8, p-value = 1.215e-10
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -192.089 -171.937
## sample estimates:
## mean in group 1 mean in group 2
## 267.4416 449.4545
RATS23 <- RATS8S_filtered %>% filter(Group == 2 | Group == 3)
t.test(mean ~ Group, data = RATS23, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -18.428, df = 4, p-value = 5.102e-05
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -100.45660 -74.14946
## sample estimates:
## mean in group 2 mean in group 3
## 449.4545 536.7576
# Add the baseline from the original data as a new variable to the summary data
RATS_bl <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS8S_non_filtered <- RATS8S %>% mutate(baseline = RATS_bl$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATS8S_non_filtered)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks like Group matters, as was evident from the graphs too… but it was practically all determined by the baseline. The p-value of 0.07 for ‘Group’ barely registers as signifying at all.
Morning sprint, let’s see how far I can get …
BPRS <- read.table(file = "data/bprs.csv", sep = ",", header = TRUE)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
ggplot(BPRS, aes(x = week, y = bprs, group = subject, colour = subject)) +
geom_line(aes(linetype = subject)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top") +
facet_grid(. ~ treatment, labeller = label_both) +
scale_linetype_manual(values = rep(1:10, times=4))
BPRS_reg <- lm(bprs ~ week + treatment, BPRS)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Based on the p-values, looks like week seems very significant here, and treatment makes no difference. An argument for the benefits of institutionalization no doubt.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
I have to confess I fell off the cart / didn’t have time to properly study the materials for the interpretation of this one.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the two models compared…